Skip to content

[GH-2824] Add NetCDF metadata extraction to sedonainfo#2829

Draft
jiayuasu wants to merge 10 commits intoapache:masterfrom
jiayuasu:netcdf-sedonainfo-support
Draft

[GH-2824] Add NetCDF metadata extraction to sedonainfo#2829
jiayuasu wants to merge 10 commits intoapache:masterfrom
jiayuasu:netcdf-sedonainfo-support

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented Apr 7, 2026

Summary

  • Add NetCdfMetadataExtractor implementing RasterFileMetadataExtractor
  • Opens NetCDF files via UCAR cdm-core, extracts metadata without reading data arrays (only lat/lon coordinate arrays for spatial extent)
  • Maps data variables to bands, reports dimensions and variables in metadata map
  • Supports .nc/.nc4/.netcdf extensions
  • 7 exact-match tests using test.nc

Depends on

Test plan

  • Existing GeoTIFF sedonainfo tests pass
  • New NetCDF tests pass (metadata values, geoTransform, cornerCoordinates, bands, overviews, dimensions)
  • Cross-validated spatial extent against RS_FromNetCDF

jiayuasu added 10 commits April 2, 2026 00:29
Add a new Spark DataSourceV2 that returns GeoTIFF file metadata
without decoding pixel data, similar to gdalinfo.

Usage: spark.read.format("sedonainfo").load("/path/to/*.tif")

Returns one row per file with: path, driver, fileSize, width,
height, numBands, srid, crs, geoTransform, cornerCoordinates,
bands (array with dataType, noData, blockSize, colorInterpretation),
overviews, metadata, isTiled, and compression.

Supports glob patterns, directory recursion, LIMIT pushdown,
and column pruning.
… logic, add docs

- Rename package from io.geotiffmetadata to io.sedonainfo
- Extract RasterFileMetadataExtractor trait for format-agnostic design
- Move GeoTIFF-specific logic into GeoTiffMetadataExtractor
- SedonaInfoPartitionReader delegates to format extractors via
  canHandle() dispatch, making it easy to add new formats
- Add documentation page for the sedonainfo data source
- Register in mkdocs.yml navigation
Generate COG files on-the-fly using RS_AsCOG and verify that
sedonainfo correctly reports isTiled=true, non-empty overviews
with proper level/width/height, and blockSize matching the
requested tile size.
…ection

- Replace all inexact assertions (>0, !=0) with exact value matches
  for test1.tiff: width=512, height=517, srid=3857, fileSize=174803,
  band type=UNSIGNED_8BITS, blockSize=256x256, etc.
- Fix overview detection to use DatasetLayout.getNumInternalOverviews()
  instead of getResolutionLevels() which returns synthetic tile-based
  levels even for non-COG files
- Add COG test that generates a COG on-the-fly via RS_AsCOG and
  verifies isTiled=true, 2 overviews, blockSize=256x256
…portsWrite, docs

- Fix isTiled: read TIFF TileWidth tag (322) from IIO metadata instead
  of RenderedImage tile size which reports strips as tiles
- Fix colorInterpretation: derive from TIFF Photometric Interpretation
  tag (262) instead of copying band description. Maps to gdalinfo
  values: Gray, Red, Green, Blue, Alpha, Palette, Undefined
- Fix SupportsWrite: remove mixin, throw UnsupportedOperationException
  in newWriteBuilder since sedonainfo is read-only
- Fix docs: remove false claim about column pruning skipping extraction
- Fix compression: read from TIFF tag 259 description attribute for
  human-readable names (e.g., "LZW", "Deflate")
- Extract TIFF IIO metadata before reader.read() to avoid stream state
  issues
…adata case classes

Make RasterFileMetadata consistent: all nested structures (bands,
overviews, geoTransform, cornerCoordinates) use dedicated case classes.
- Add NetCdfMetadataExtractor implementing RasterFileMetadataExtractor
- Opens NetCDF files via UCAR cdm-core, extracts metadata without
  reading data arrays (only lat/lon coordinate arrays for spatial info)
- Maps data variables to bands (numBands = number of record variables)
- Reports dimensions and variables in metadata map
- Supports .nc/.nc4/.netcdf extensions
- Update glob patterns in SedonaInfoDataSource to include NetCDF files
- Add 7 exact-match tests using test.nc (O3/NO2 variables, 80x48 grid)
Pass requiredFields from Spark's readDataSchema to extractors so they
can skip expensive work (bands, overviews, metadata, compression, CRS
WKT) when those columns are not selected in the query.
Move NetCDF metadata extraction to a follow-up PR. This PR focuses
on GeoTIFF metadata extraction via the sedonainfo data source.
- Add NetCdfMetadataExtractor implementing RasterFileMetadataExtractor
- Opens NetCDF files via UCAR cdm-core, extracts metadata without
  reading data arrays (only lat/lon coordinate arrays for spatial info)
- Maps data variables to bands (numBands = number of record variables)
- Reports dimensions and variables in metadata map
- Supports .nc/.nc4/.netcdf extensions
- Update glob patterns in SedonaInfoDataSource to include NetCDF files
- Add 7 exact-match tests using test.nc (O3/NO2 variables, 80x48 grid)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant